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Abstract—One of the more challenging real-world problems 
in computational intelligence is to learn from non-stationary 
streaming data, also known as concept drift. Perhaps even a 
more challenging version of this scenario is when — following 
a small set of initial labeled data — the data stream consists 
of unlabeled data only. Such a scenario is typically referred 
to as learning in initially labeled nonstationary environment, or 
simply as extreme verification latency (EVL). Because of the very 
challenging nature of the problem, very few algorithms have been 
proposed in the literature up to date. This work is a very first 
effort to provide a review of some of the existing algorithms 
(important/prominent) in this field to the research community. 
More specifically, this paper is a comprehensive survey and 
comparative analysis of some of the EVL algorithms to point out 
the weaknesses and strengths of different approaches from three 
different perspectives: classification accuracy, computational com- 
plexity and parameter sensitivity using several synthetic and real 
world datasets. 


Index Terms—Concept drit, domain adapatation, unsupervised 
learning, verification latency, EVL. 


I. INTRODUCTION 


HE fundamental goal in machine learning is to learn from 

data. Most machine learning algorithm, regardless of the 
availability of labeled data, make a fundamental assumption 
that data are drawn from a fixed but unknown distribution. This 
assumption implies that test or field data come from the same 
distribution as the training data. In reality, this assumption 
simply does not hold in many real world problems that 
generate data whose underlying distributions change over time. 
Network intrusion, web usage and user interest analysis, natural 
language processing, speech and speaker identification, spam 
detection, anomaly detection, analysis of financial, climate, 
medical, energy demand, or pricing data, as well as the analysis 
of signals from autonomous robots and devices, brain signal 
analysis, and bio-informatics are just a few examples of the 
real world problems where underlying distributions may — and 
typically do — change over time. 

In machine learning, the challenge of making decisions 
in a changing environment is referred to as non-stationary 
learning. This is a challenging problem, because the classifier 
needs to adapt to a new concept in the changing environment, 
while retaining the previously acquired knowledge that is still 
relevant to ensure a stable learning environment, a phenomenon 
commonly referred to as the stability-plasticity dilemma in 
literature [I]. The fixed distribution assumption, essentially 
requiring the data to be drawn independently from an identical 
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distribution (also referred to as independent and identically 
distributed - i.i.d.) renders traditional learning algorithms 
that make this assumption ineffective at best, misleading and 
inaccurate at worst on non-stationary distribution problems. 
Concept drift techniques [2], B1, (4), (5), (6), Z] and domain 
adaptation approaches [8], [9] have been developed to tackle 
two related but different issues related to non-stationary distri- 
butions: domain adaptation techniques are designed to handle 
mismatched training and test distribution over a single time- 
step, while concept drift approaches are designed to track 
the data distributions over a streaming setting. However, both 
approaches assume that there is (preferably ample) labeled 
training data, and the potential scarcity or the high cost of 
obtaining labeled data is a major obstacle faced by these 
approaches. 

In an effort to reduce the amount of required labeled data, 
semi supervised learning (SSL) and active learning (AL) 
approaches have been employed. SSL approaches, of course, 
also require labeled data at each time step [10], albeit in smaller 
quantities. Active learning (AL) is another approach to combat 
the limited availability of labeled data [11], where the learner 
actively chooses which data instances — if labeled — would 
provide the most benefit. The unavailability of labeled data, 
particularly in streaming applications, gives rise to another 
problem, commonly referred to as verification latency in 
the literature [12], where labeled data are not available at 
every time step. More specifically, verification latency refers 
to the scenario where labels of the training data becoming 
available only certain or some unspecified amount of time later, 
significantly complicating the learning process. The duration 
of the lag in obtaining labeled data may not be known a 
priori, and/or may vary with time. The extreme case of this 
phenomenon, aptly named as the extreme verification latency, 
is perhaps the most challenging case of all machine learning 
problems: labels for the training data are never available - 
except perhaps those provided initially, yet the classification 
algorithm is asked to learn and track a drifting distribution with 
no access to labeled data. There are few algorithms proposed in 
the literature by different researchers to provide solution to the 
problem of extreme verification latency, however the extensive 
comparative analysis of these different algorithms is missing. 
Noting the importance of this problem in many real world 
applications, availability of such analysis to the community is 
indeed needed. 

The primary goal of this work is to provide a comprehensive 
analysis of the extreme verification latency (EVL) learning 
algorithms from three different perspectives i.e. classification 
accuracy, computational complexity, and parameter sensitivity. 
To the best of our knowledge, this is the first comprehensive 
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work in this regard. EVL, which is also referred as initially 
labeled non-stationary environment (ILNSE) in [13] is an 
extremely important but difficult problem, therefore commonly 
ignored by the researchers. This work is an effort to moti- 
vate the researchers to provide more realistic and practical 
solution to the problem under discussion. In particular, this 
work considers the following important algorithms proposed 
in the literature to work under EVL setting; i) Arbitrary 
Sub-Population Tracker (APT); [14], ii) COMPacted Object 
Sample Extraction (COMPOSE) [13]; iii) Stream Classification 
Algorithm Guided by Clustering (SCARGC) [15]; iv) and 
Micro-cluster for Classification (MClassification) [16]. 


II. EXTREME VERIFICATION LATENCY LEARNING 
ALGORITHMS 


A. Arbitrary Sub-Population Tracker Algorithm (APT) 


APT algorithm is proposed by Krempl to handle 
extreme verification latency scenarios under specific scenarios. 
The main principle underlying APT algorithm is that each 
class in the data can be represented as a mixture of arbitrarily 
distributed sub-populations. The APT algorithm makes the 
following important assumptions 

1) The underlying population of the feature space contains 
several sub-populations, each of which drifts (possibly) 
differently over time; 

2) Initial labeled data are used to represent each sub- 
population of the feature space, where a sub-population 
is defined as a mode in the class-conditional distribution 
p(y|z), with p(y) representing the prior distribution of 
the class labels, and p(x) representing the marginal 
feature distribution; 

3) The drift is gradual and “systematic” that can be repre- 

sented as a piecewise linear function; 

The conditional posterior distribution remains fixed, i.e., 
a component’s class label cannot change 

5) Co-variance of each component remains constant. 


4) 


The learning strategy of APT is twofold; first, the optimal one- 
to-one assignment between labeled instances in time-step t 
and unlabeled instances in time-step t + 1 is determined using 
expectation maximization (EM) algorithm. The EM algorithm 
begins with the expectation step by predicting which instances 
are most likely to correspond to a given sub-population. During 
the maximization step, the algorithm determines which drift 
parameters maximize the expectation. Then, the classifier is 
updated to reflect the population parameters of the newly 
received data and drift parameter relating the previous time 
step to the current one. 

Establishing a one-to-one relationship while identifying drift 
requires an impractical assumption that the number of instances 
remains constant throughout all time steps. Krempl relaxes this 
assumption by establishing a relationship in a batch method - 
matching a random subset of exemplars to a subset of new 
observation until all new observations have been assigned 
a relationship to an exemplar. Krempl suggests a bootstrap 
method that can make the one-to-one assignments more robust, 
but at an additional computational cost. When the assumptions 
are satisfied, APT works very well. However, APT has two 


primary weaknesses: 1) some of its assumptions often do 
not hold true, causing a decrease in performance, and 2) it 
is computationally very expensive [13]. 

The pseudocode for APT algorithm is given in Algorithm 


Algorithm 1: Arbitrary Subpopulation Tracker (APT) 


Inputs: Initial labeled data Dinit; A clustering algorithm with 
its own free parameters; a suitable bandwidth matrices 
calculation algorithm; a suitable expectation-maximization 
(EM) algorithm with its free parameters) 

1: Receive M training examples form Dinit = {xi; yi}; i = 
l, M; cE X;yeY = {l,...,ch; 

2: Run clustering algorithm to partition the data into K 
disjoint subsets and associate each cluster to one class 
among c classes ; 

3: Estimate the conditional feature distribution of the data; 

4: Receive new unlabeled instances Ut = 
{xt E€ X ,u=1,...,N} and assume N = M to 
associate each new instance to one previous example; 

5: Compute instance-to-exemplar correspondence by maxi- 
mizing the likelihood using EM algorithm; 

6: Pass the cluster assignment from the example to their as- 
signed instances to achieve instance-to-cluster assignment; 


7: Pass the class of an example x; i.e. y; to the class of its 
assigned instance; 
8: Go to step [2] and Repeat. 


B. COMPacted Object Sample Extraction (COMPOSE) 


1) COMPOSE.V1 (Original COMPOSE With a-Shape Con- 
struction): The COMPacted Object Sample Extraction (COM- 
POSE) framework is introduced in to address the ex- 
treme verification latency problem in an extreme verification 
latency setting. The algorithm only makes an assumption of 
gradual/limited drift in the data, and consists of two important 
modules: semi-supervised learning algorithm (SSL) and the 
core-support extraction (CSE) module. It is an iterative proce- 
dure that uses an SSL algorithm to label the current unlabeled 
data using the initial labeled data. It then uses the core support 
extraction module to construct a shapes for each class and 
thus represent the current class conditional distribution, where 
a-shape can be described as a generalization of the convex 
hull of the dataset, where the convex hull of a dataset X € IR? 
is the convex shape with minimum area that contains all of 
the observations in X, and can be described as the set of all 
possible convex combinations of the points in X, or 


|X| 

{So aja;|(Vi: a; > 0) AS > a; = 1} (1) 
i=1 i 

for all possible a;. 

The a shape is then compacted (shrunk), creating the core 
support region, and instances that fall inside this region are 
extracted as the core supports that represent the geometric 
center (core support region) of each class distribution. These 
now-labeled instances are used as the labeled information — 
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along with the incoming new unlabeled data — to train the SSL 
algorithm during the next time step. This process is repeated 
every time there is a new batch of unlabeled data available. 
The pseudocode and implementation details of the original 
COMPOSE version that uses a-shape construction to extract 
core supports can be seen in Algorithm [2] 

COMPOSE.V1 requires the following as input: i) an SSL 
algorithm such as cluster and label, label propagation [7], 
or semi-supervised support vector machines with relevant 
free parameters; and ii) a CSE algorithm, i.e., a shape creation 
algorithm with parameters a-shape detail level, a, and a 
compaction percentage, CP, that represents the percentage of 
current labeled instances to use as core supports. The algorithm 
is seeded with initial labeled data Dinit in step [I] COMPOSE 
starts by receiving N unlabeled instances U* in each time-step. 
The SSL algorithm is then trained using the current unlabeled 
and labeled instances, which returns an hypothesis ht that 
classifies all unlabeled instances of the current time-step in 
step [4 The hypothesis is then used to generate a combined set 
of data, D+, in step |5| and the combined data for each class is 
used as the input for the CSE routine in step [8] The resulting 
core supports C'Se, for each class c, are appended to be used 
as current labeled data in the next time-step in step p] 


Algorithm 2: COMPOSE.V1 


Inputs: SSL algorithm - SSL with relevant free parameters; 
CSE algorithm - CSE; a-shape detail level-a Compaction 
percentage - CP 

1: Receive initial labeled data Dinit = {£i; yi}; i =1,...,M 
xE X;y EY = {l,...,c} 
Set L? = {x} } ; initial instances 
Set Y° = {yt} ; corresponding labels of initial instances 
2: for t = 0, 1,.... do 
3: Receive unlabeled data U* = {xt € X , u = 1,..., N} 


4: Run SSL with Lt , Yt, and U+ 
to obtain hypothesis, ht : X > Y 
5: Let Dt = { (xt, yt) : x € LVI} U 
{(xt, h'(at,)) : a € UVu} 
6 Set Ltt! =p ytt! = 0 
7: for each class c=1,2,....,C do 


8: Run CSE with CP , a and D£ 
to extract core supports, C'S, 
9: Add core supports to labeled data 


i = Acs, 

YHH = YH U fy, : u € [|CS.|], y = c} 
10: end for 
11: end for 


2) COMPOSE.V2 (COMPOSE With Gaussian Mixture 
Model (GMM) or Any Density Estimation Technique): One 
of the central processes of COMPOSE is the core support 
extraction, where the algorithm predicts which data instances 
of the current environment will be useful and relevant for 
classification in future time-steps, where the underlying data 
distributions may have changed. In the original version of 
COMPOSE, a-shape construction is used for this process, but 
a-shape construction is a computationally very expensive pro- 


cess, especially when the dimensionality of the data increases. 
This is because a-shape construction requires Delaunay tes- 
sellation of the data, and the algorithm used for this purpose 
is the Quickhull algorithm [13]. This algorithm is of order 
O(n(¢+)/?) where n is the number of observations and d 
is the dimensionality of the data. Hence, the algorithm is 
exponential in dimensionality. In order to reduce the com- 
putational complexity of the algorithm, we make use of the 
fact that the goal of the CSE is to extract the labeled data 
from each class by creating an object or shape around the 
data and by compacting that object. This process is essentially 
equivalent to density estimation.Therefore, more efficient den- 
sity estimation techniques can be used. One such approach is 
Gaussian Mixture Model (GMM), though any other density 
estimation technique can also be used here such as Parzen 
windows or KNN. We observe that GMM are significantly more 
computationally efficient than a-shape. The Gaussian mixture 
model (GMM) is a probabilistic model that describes the data 
as a mixture of unimodal Gaussian distributions, and tries to 
fit K Gaussians to the data X where K is a user specified 
parameter. The probability density function is the weighted 
sum of the K Gaussians as given by the following equation, 


K 


P(O) = > TN (uk, De) 


k=1 


(2) 


where 0 is the set of parameters describing the entire model, 
Lk, Uz, 7 are the mean, covariance, and mixing coefficient 
(i.e., prior probability) of each Gaussian component respec- 
tively. 

The major advantage of using GMMs is that GMMs are 
significantly more computationally efficient than a-shapes, 
particularly when d is large. The computational complexity of 
the EM procedure for GMMs is difficult to quantify, because 
it is an iterative procedure, but it has been shown that the 
E-step and the M-step are of the order O(NKd + NK) 
and O(2N Kd), respectively, for each iteration, where N is 
the number of observations, K is the number of mixture 
components and d is the dimensionality. Our results in chapter 
5 confirms that the GMM approach is indeed substantially 
faster than constructing a-shapes for any given dimensionality 
and data cardinality. The pseudocode and implementation 
detail of COMPOSE. V2 is similar to COMPOSE.V1 with the 
difference of using GMM instead of the a-shapes construction 
for core supports extraction module. 

3) COMPOSE.V3 (Learning Extreme Verification Latency 
Quickly: FAST COMPOSE): The third version of COMPOSE 
modifies the core support extraction module based on the 
following observation. Originally a significant overlap of class 
conditional distributions between consecutive time steps was 
thought to be the working definition of gradual / limited 
drift, and hence a necessary condition for COMPOSE to work. 
However, Sarnelle et al. showed in that COMPOSE can 
work equally well for scenarios even when there is no overlap 
of distributions in consecutive time steps, as long as the 
distance between the unlabeled data with core supports of 
a given class is less than the distance from the nearest core 
supports of any other opposing class. We refer to this condition 
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as limited drift, and now distinguish it from gradual drift that 
does require an overlap of distributions in subsequent time- 
steps. As a result, we show that the condition of significant 
overlap (or gradual drift) can be eliminated, and replaced with 
the more relaxed condition of limited drift. We observe that 
for cases where there is no significant overlap, core support 
extraction procedure has very little impact on accuracy because 
it does not change centroids in any considerable amount, and 
clustering based SSL algorithm can easily track the drifting 
distributions using nearest centroids. 

Additionally, as described above, the density estimation 
procedure is impractical for high dimensional data due to its 
computational complexity. Taken together then, an obvious 
questions that comes to mind is whether the density estimation 
based core support extraction is needed at all. To answer this 
question, we removed the core support extraction procedure 
of COMPOSE entirely, and all instances labeled by the semi- 
supervised algorithm are then used as “core supports,” i.e., the 
most representative instances for the future time-steps. We call 
this modified version of the algorithm FAST COMPOSE [20]. 

The pseudocode and implementation details of FAST COM- 
POSE are shown in Algorithm FAST COMPOSE only 
requires an SSL algorithm with its relevant free parameters 
as an input. The algorithm begins by receiving M initially 
labeled instances, L°, and corresponding labels Y°, of C 
classes in step [I] The algorithm then receives a new set of N 
unlabeled instances U*. The SSL algorithm is then executed 
given the current unlabeled and labeled instances to receive the 
hypothesis ht of the current time-step in step |4| The hypothesis 
is then used to label the data for the next time-step as shown 


in steps [5] - [8] of Algorithm 
Algorithm 3: FAST COMPOSE 


Input: SSL algorithm - SSL with relevant free parameters 
1: Receive labeled data 
L? = {xt € X}, 
Y°? = {yt € Y = {1,...,C},l=1,...,M} 
2: for t = 0, 1,.... do 
3: Receive unlabeled data U* = {xt € X , u= 1,..., N} 


4: Run SSL with Lt , Yt, and U* 
to obtain hypothesis, ht : X > Y 
5: Let Dt = { (xt, ht (xt) : x € UtYu} 
6 Set LH =p ytt! —9 
7: for each class c= 1,2,....,C do 
8 CS. = {x: x € Dt} , and add to labeled data for 
next time-step 
Drs: 
YH = YHL U {yu : u € [|C Sel], y = c} 
9: end for 
10: end for 


C. Stream Classification Algorithm Guided by Clustering 
(SCARGC) 


SCARGC is a clustering-based algorithm proposed by Souza 
et al to deal with extreme verification latency problem, that 
repeatedly clusters unlabeled input data, and then classifies the 


clusters using the labeled clusters from the previous time-step. 
SCARGC also makes several assumptions: 


1) A small amount of labeled data is available initially to 
define the problem; 

The drift is gradual / incremental, which allows tracking 
of the classes with only unlabeled information. Incre- 
mental drift assumption as used in SCARGC requires 
significant overlap between class distributions in subse- 
quent time steps and short intervals of time; 

3) The number of classes is known and fixed ahead of time. 


2 


wa 


Given the aforementioned assumptions, the algorithm builds 
an initial classification model using the available labeled data 
from c classes, and then divide the initial labeled data into 
k > c clusters where k is a user-selected free parameter. 
If user selects k = c, SCARGC uses c classes as initial 
clusters, otherwise a clustering subroutine finds clusters and 
associates each cluster with one class. Souza denotes this 
initial set of k clusters as C° = C?,C9,....... ,C?. As new 
unlabeled data are received, the algorithm stores each example 
in a pool, and predicts its label using the initial classification 
model. After a fixed number of examples, also pre-determined 
by the user, are received and stored in the pool, the pool 
of examples is clustered into k clusters in the same way 
as initial labeled data are clustered, i.e., by using c classes 
as initial clusters if k = c, otherwise running a clustering 
subroutine to associate each cluster with one class. The new 
set of clusters are denoted as Ct = C1, Ci,...... , Cy. Each 
new cluster C} € C! is then associated with (linked to) one 
of the previous clusters O € C° to assign each cluster 
to one class. The classification model is updated using the 
recently labeled examples. The algorithm then repeats the 
loop, alternating between clustering and classification. The 
labels are decided by associating clusters C* in the current 
iteration with the labels of clusters C*~! from the previous 
iteration. The mapping between the clusters is performed by 
centroid similarity between current and previous iterations 
using Euclidean distance. Given the current centroids from 
the most recent unlabeled clusters and past centroids from the 
previously labeled clusters, one-nearest neighbor algorithm (or 
support vector machine) is used to label the centroid from 
current unlabeled clusters. 


SCARGC is computationally efficient, but its performance 
is highly dependent on the clustering phase. It also requires 
some prior knowledge such as the number of classes and the 
number of modes for each class in the data, the latter of which 
may limit the use of this algorithm when such information 
is not available. The pseudocode for SCARGC algorithm is 
given in Algorithm [4] 
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Algorithm 4: SCARGC 


Inputs: Initial training data Dinit, maximum pool size N, 

number of clusters k; 

1: Receive initial labeled data Dinit = {xi; yi} 5 
;c eX ye Y = {l,...,c} 

2: Build initial classifier @ using Dinit 

3: Run k-means clustering algorithm to divide the data into k 
clusters; {C’ = C}, C$, ..., Ct} and associate each cluster 
with one of the c classes 

4: Start receiving new unlabeled examples from unlabeled 
data stream U = {x, € X} 

5: Store the next batch of N examples in a pool 

6: Predict labels of stored examples using classifier ¢ as 
Dnew = {fu; (Lu) hu = 1,...,N 

7: Run k-means clustering algorithm on Dnew to obtain 
(O ah Oars 6 ae 

8: Establish a mapping between current and previous clusters: 
the current clusters C**? are associated to previous clusters 
Ct by measuring similarity between their centroids qf; i = 
{1,...,&} using Euclidean distance, i.e., Dist(q:, qt+1) 
where Dist represents Euclidean distance 

9: Assign current centroid qf; the label 9; which is same 
label y; of the closest past centroid q; 

10: The current dataset now has the updated correct labels 
from the previous step as Di4i = {2u3Gu)};u=1,...,N 


¿= 1 


ETTI 


11: Update the initial classifier ¢ using Dy, 
12: Go to step [4] and repeat 


D. Micro-Cluster for Classification (MClassification) 


Souza et al. also proposed MClassification, an algorithm 
that uses the idea of micro clusters (MC) to adapt to the 
changes in the data over time, and learn the concepts under 
extreme verification latency. A Microcluster (MC) is a compact 
representation of the data points ¢j;i = {1,...,N}, that 
includes the sufficient statistics of the data and are represented 
in triplets (N, LS, S5), where NV is the number of data points 
in the cluster, LS is the linear sum of N data points represented 
as LS = {a ++... + £n}, and SS is the square et 
of data points represented as S5 = {i gis ein + 2, g9 
Thus a MC summarizes the information about the set of N data 
points, from which we can calculate the centroid and radius 
of the MC using the following equations 


(3) 


L 
centroid = — 


N 


(4) 


Although MC is efficient and appropriate for data streaming 
problems, the authors observe that MC representation has 
been commonly used in clustering problems. In order to use 
MC to classify evolving data streams, the authors modify the 
representation to store information about the class of data 
points, thus their representation is a 4-tuple (N, LS, SS, y), 
where y is the label for a set of data points. The working of 
the algorithm is presented below. 


The algorithm begins by receiving the initial labeled data 
Dinit, using which it builds a set of labeled MCs, where each 
MC has information about only one example. The algorithm 
then starts receiving the unlabeled data stream. A label y; is 
then predicted for each example x; from the stream based on 
its nearest MC, computed with respect to Euclidean distance 
in the classification phase. The example x; is added to its 
corresponding nearest MC, say MC'y. Now the updated radius 
of M C'n is computed and the algorithm checks if the updated 
radius of MCy exceeds the maximum micro-cluster radius 
threshold r defined by the user. If the radius does not exceed 
the threshold r, the example £, remains added in MC'y and its 
updated centroid is also computed. The centroid position of the 
updated MC, i.e., M/C'y is therefore slightly moved in direction 
of the newly emerging concept of the class for new example 
added. On the other hand, if the radius exceeds the threshold, 
anew MC say MC‘, carrying the predicted label 4; is created 
to allocate the new example £}. The process is repeated for 
each newly received unlabeled example. The pseudocode for 
MClassification algorithm with the implementation details is 
provided in Algorithm [5] 


Algorithm 5: MClassification 


Inputs: Maximum micro-cluster radius r; 
1: Receive initial labeled data Dinit = {xi3 yi} ; 
xE X;y EY = {l,...,c} 
2: Build T micro-clusters as MC; = (Ni, LSi, a i = 
1,..., T where N = number of data points ; LS = ae 
; Bn S= 5 j=1 (x; i 


3: Calculate sufficient ee o each _micro-cluster as 


=l T 


yeg 


follows centroid; = N. EA 
4: Receive one new uilbeled annie Z, from the unlabeled 
data stream 
U = {ry E X} 
5: Measure distance between x; and each micro-cluster cen- 
troids centroid;;i = {1,..., T} ie. Dist(centroidi, £4) 
to find closest micro-cluster say MC'y, where Dist rep- 
resents the Euclidean distance 
6: Assign label of MCy i.e. f to classify example z} 
7: Add example x, to MCy and compute its sufficient 
statistics radiusy ; and centroidn 
8: if radiusy > r then 
9: Create a new micro-cluster for example #; say MCh, = 
(Ny, LSN, SSN Ge) 

10: else 

11: Add example x, to MCy and update its statistics 
as (LS) — (LS) + T; (SS) — (SS'y) + 
(B); Nn e Nn +1 

12: end if 

13: Go to step |4| and repeat 


E. LEVEL,w: Learning Extreme VErification Latency 
With Importance Weighting 

LEVEL;w is based on the observation that importance 
weighting based domain adaptation used for covariate shift and 
concept drift problems are related, though algorithms for each 
make different assumptions. Concept drift problems typically 
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assume at least a gradual (or at least limited) drift assumption, 
but do not require stationary posteriors or shared support while, 
covariate shift assumes that the class conditional distributions at 
consecutive time steps share support and posterior distributions 
do not change. 

More specifically, authors of LEVELrw observed that COM- 
POSE originally assumed a significant distribution overlap at 
consecutive time steps, allowing instances lying in the center of 
the feature space to be used as the most representative labeled 
instances from current time step to help label the new data 
at the next time step. Such an assumption is also inherent 
in importance weighting based domain adaptation, but only 
for a single time step with mismatched train and test data 
distributions. They therefore explore importance weighting not 
for a single time step matching training / test distributions, but 
rather matching distributions between two consecutive time 
steps, and estimate the posterior distribution of the unlabeled 
data using importance weighted least squares probabilistic 
classifier (IWLSPC) [21]. The estimated labels are then it- 
eratively used as the training data for the next time step. They 
call this algorithm LEVEL;w: Learning Extreme VErification 
Latency with Importance Weighting. The pseudocode and 
implementation details of this approach are described below 
and summarized in Algorithm [6] LEVELyw takes advantage of 
the importance weighted least squares probabilistic classifier 
(IWLSPC) as a subroutine [21], and hence serves as a wrapper 
approach. 


Algorithm 6: LEVEL iw 
Inputs: Importance weighted least squares probabilistic clas- 
sifier - IWLSPC; Kernel bandwidth value o 
1: At t = 0, receive initial data x € X and the corresponding 
labels y € Y = 1,...,C. 
Set xi=° = x 


Set ye” =y 
2: for t =1,...., do 
3: Receive new unlabeled test data xt, € X 
4: Set xt, = xt" 
5: Set Yip = Yio” 
6: Call IWLSPC with x!,.,x!., yj, and o to estimate yt, 
7: end for 


Initially, at £ = 0, LEVELrw receives data x with their 
corresponding labels y, initializes the test data xf>° to initial 
data x received, and sets their corresponding labels y’=° equal 
to the initial labels y. Then, the algorithm iteratively processes 
the data, such that at each time step t, a new unlabeled test 
dataset x}, is first received, the previously unlabeled test data 
from previous time step xit, which is now labeled by the 
IWLSPC subroutine, becomes the labeled training data xź,. for 
the current time step, and similarly the labels yt; obtained by 
IWLSPC during the previous time step become the labels of 
the current training data x‘,.. The training data at the current 
time step xt,., the corresponding label information at the current 
time step y/,., the kernel bandwidth value o and the unlabeled 
test data at the current time step xt, are then passed onto the 


IWLSPC algorithm, which predicts the labels yt, for the test 
unlabeled data. The entire process is then iteratively repeated. 

We also note that two other algorithms are also proposed 
to work in the EVL setting more recently; these are called 
TRACE [22], which tracks the trajectory of the clusters over 
time using some trajectory prediction algorithm for instance 
Kalman filter, instead of tracking clusters using unsupervised 
learning algorithms as done in COMPOSE and SCARGC, and 
Affinity-based COMPOSE [23], which is based on COMPOSE 
with a slight modification in the core support extraction module. 
Affinity based COMPOSE uses only those samples from 
the previous timestep as the labeled information which has 
the highest similarity scores with the unlabeled samples at 
current time step, computed from the affinity matrix. These 
algorithms are useful contribution to the literature however, we 
observe that TRACE does not show any statistically significant 
improvement from the other algorithms already proposed in 
the literature and Affinity based COMPOSE only shows the 
incremental improvement over FAST COMPOSE in some of 
the experiments designed by the authors of the paper. Due to 
these reasons we do not include these two algorithms in our 
experiments. 


II. 


We analyze the algorithms’ behavior from three different 
perspectives: the average classification accuracy shown in Table 
computational complexity of these algorithms as measured 
in runtime on a fixed system shown in Table |I| and a more 
detailed parameter sensitivity based analysis shown in Tables 
and Our analyses here include SCARGC, 
MClassification, COMPOSE and LEVEL. Arbitrary sub- 
population tracker (APT) was not included in the analyses, 
as this algorithm’s steep computational complexity was pro- 
hibitive on running of some of the larger datasets. This behavior 
of APT was also previously reported in [13], even on a simple 
bi-dimensional problem. The analysis of this algorithm in [I3], 
when originally compared to COMPOSE also revealed another 
significant shortcoming — that APT requires all modes of the 
data distribution to be present at the initialization, and hence 
can not accommodate scenarios where a distribution splits 
into multiple modes or vice versa over time. Taken together, 
then, these two concerns rendered APT to be less competitive 
compared to other algorithms in real world scenarios and hence 
was not included in further analysis. The reason we described 
the working of APT algorithm in detail in section II, is that 
it was the very first algorithm that introduced this problem of 
extreme verification latency to the computational intelligence 
research community, and thus motivated the other researchers 
in the field to propose efficient algorithms to work in the EVL 
setting. 

The results discussed below are organized by the algorithms, 
discussing the observations made for each algorithm under 
evaluation in comparison to others. 


EXPERIMENTS & RESULTS 


A. Analysis of Three Versions of COMPOSE 


1) Accuracy Comparison: Average accuracy results com- 
paring all three versions of COMPOSE (COMPOSE with a- 
shapes, with GMM and FAST COMPOSE), do not show any 
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TABLE I: Average classification accuracy 

DATASETS COMPOSE (a-shape) | COMPOSE (GMM) | FAST COMPOSE | SCARGC (I-NN) | SCARGC (SVM) | MClassification | LEVELpw 
1CDT 99.96(2) 99.85(5) 99.97(1) 99.69(7) 99.72(6) 99.89(4) 99.92(3) 
1CHT 99.60(2) 99.34(6) 99.57(3) 99.69(1) 99.27(7) 99.38(5) 99.52(4) 
1CSurr 90.95(5) 89.72(6) 95.64(1) 94.53(3) 94.99(2) 85.15(7) 91.30(4) 
2CDT 96.58(1) 95.92(2) 95.17(4) 87.71(6) 87.82(5) 95.23(3) 58.32(7) 
2CHT 90.39(1) 89.63(2) 89.41(3) 83.62(5) 83.39(6) 87.93(4) 52.15(7) 

4CE1CF 93.92(5) 93.90(6) 93.95(4) 94.04(3) 92.79(7) 94.38(2) 97.74(1) 
4CR 99.99(2.5) 99.99(2.5) 99.99(2.5) 99.96(6) 98.94(7) 99.98(5) 99.99(2.5) 

4CRE-V2 92.59(1) 92.30(3) 92.46(2) 91.34(6) 91.46(5) 91.59(4) 24.10(7) 

FG_2C_2D 87.90(6) 95.50(5) 95.58(3) 95.51(4) 95.60(2) 62.48(7) 95.71(1) 

GEARS_2C_2D 90.98(7) 95.83(3) 91.26(6) 95.99(2) 95.81(4) 94.73(5) 97.74(1) 

MG_2C_2D 93.12(2) 93.20(1) 93.02(3) 92.92(5) 92.94(4) 80.58(7) 85.44(6) 

UG_2C_2D 95.63(3) 95.71(1) 95.61(5) 95.65(2) 95.62(4) 95.28(6) 74.34(7) 

UG 2C 3D 94.92(3) 95.20(1) 95.12(2) 94.83(5) 94.91(4) 94.72(6) 64.69(7) 

UG_2C_5D 92.07(2) 92.13(1) 91.99(3) 91.38(4) 90.94(6) 91.25(5) 80.17(7) 

keystroke 84.31(7) 87.21(5) 85.92(6) 88.07(3.5) 88.07(3.5) 90.62(1) 90.56(2) 
Average Rank (lower is better) 3.2813 3.4688 3.1563 4.1563 4.8438 4.5000 4.5938 
TABLE II: Average execution time (in seconds) 

DATASETS COMPOSE (a-shape) | COMPOSE (GMM) | FAST COMPOSE | SCARGC (I-NN) | SCARGC (SVM) | MClassification | LEVELpw 
1CDT 19.18(6) 4.21(3) 1.15(1) 10.20(4) 2.50(2) 64.75(7) 15.02(5) 
1CHT 19.76(6) 4.04(3) 1.17(1) 10.76(4) 3.29(2) 62.36(7) 15.34(5) 
1CSurr 72.84(6) 7.32(2) 2.53(1) 51.78(5) 16.40(3) 220.49(7) 43.83(4) 
2CDT 20.21(6) 2.89(2) 1.46(1) 10.00(4) 3.34(3) 62.48(7) 15.71(5) 
2CHT 19.59(6) 3.55(3) 1.41(1) 10.09(4) 2.89(2) 60.77(7) 15.79(5) 

4CE1CF 241.16(6) 44.14(2) 8.41(1) 210.97(5) 134.56(3) 775.597) 137.82(4) 
4CR 213.51(6) 55.90(2) 12.04(1) 91.22(4) 56.22(3) 608.00(7) 148.32(5) 
4CRE-V2 216.55(5) 34.82(2) 6.44(1) 280.27(6) 41.51(3) 641.46(7) 147.81(4) 
FG_2C_2D 229.34(5) 16.04(2) 3.80(1) 587.19(6) 54.58(3) 870.12(7) 185.77(4) 
GEARS_2C_2D 237.24(5) 14.45(2) 2.50(1) 609.95(7) 26.91(3) 497.87(6) 186.42(4) 

MG 2C 2D 228.96(5) 15.38(2) 4.26(1) 583.76(6) 53.44(3) 740.750) 190.81(4) 

UG_2C_2D 115.30(5) 16.92(2) 3.45(1) 152.24(6) 23.27(3) 362.48(7) 72.69(4) 

UG_2C_3D 936.18(7) 15.64(2) 2.60(1) 747.96(5) 62.28(3) 881.07(6) 176.53(4) 

UG_2C_5D 2138.39(7) 15.97(2) 2.65(1) 849.03(5) 265.92(4) 977.53(6) 176.84(3) 

keystroke 31761.70(7) 2.02(4) 1.16(3) 0.82(2) 0.68(1) 6.62(6) 2.30(5) 
Average Rank (lower is better) 5.8125 2.3750 1.1250 4.9375 2.6875 6.7500 4.3125 


significant difference among them, or among any of the other 
algorithms as shown in Table [M] We do observe, however, that 
FAST COMPOSE - while not quite with statistical significance 
at 0.05 level — does perform consistently better on most datasets 
compared to all other algorithms, and provides the lowest 
overall average rank (lower rank is better in performance, rank 
1 is the best algorithm and rank 7 is the worst algorithm). 


2) Computational Complexity Comparison: Computational 
complexity (as measured in seconds for runtime) among the 
three versions of COMPOSE as well as other algorithms also 
provide some useful and interesting results. As shown in 
Table COMPOSE (a-shape) is found to be the second 
worst algorithm in terms of computational complexity after 
MClassification, and performs significantly worse than all 
other algorithms except SCARGC (1-NN), MClassification and 


LEVELyw (with no significant difference among the last four). 


For COMPOSE (a-shape), the curse of dimensionality is the 
biggest bottleneck as can be seen from the significantly large 
computation time it takes for two datasets with even modestly 
high dimensionality: a 5-dimensional dataset UG_2C'_5D and 
the 10-dimensional real world dataset keystroke. We can 
easily see that the computational complexity of COMPOSE 


(a-shape) increases exponentially with dimensionality, and 
therefore is impractical to use for large dimensional datasets. 
With respect to execution time, COMPOSE (GMM) shows 
significant improvement over COMPOSE (a-shape), SCARGC 
(1-NN), and MClassification as seen in Table [IV] FAST COM- 
POSE shows significant improvement over all other algorithms 
except COMPOSE (GMM) and SCARGC (SVM). We observe 
that FAST COMPOSE also performs consistently better on 
most datasets with respect to computation time, providing the 
lowest rank as shown in Table FAST COMPOSE thus 
comes out to be the fastest running algorithm to handle extreme 
verification latency. 


3) Parameter Sensitivity Comparison: In addition to classi- 
fication accuracy and runtime based computational complexity, 
we also investigated the parameter sensitivity of these algo- 
rithms. Parameter sensitivity analysis measures the robustness 
of a given algorithm’s performance in response to changes in 
the algorithm’s most influential free parameters. In general, we 
prefer stable algorithms, whose performances do not change 
wildly for modest changes in their free parameters. 


COMPOSE.V1 and COMPOSE.V2 employ two modules, 
namely core support extraction and semi-supervised learning 
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(SSL), each requiring their own free-parameters. The primary 
free parameters for COMPOSE-V1 are a-shape detail level a, 
a-shape compaction percentage C P, and the number of clusters 
k for cluster and label SSL algorithm. COMPOSE.V2 requires 
the number of Gaussian mixtures components K, compaction 
percentage parameter CP, and number of clusters parameter 
k for cluster and label SSL algorithm. All these parameters 
normally require fine tuning in order to give good results. 
COMPOSE.V3, i.e. FAST COMPOSE, is introduced primarily 
to reduce the computation complexity of the algorithm, but 
it also reduces the number of free-parameters by removing 
the core support extraction module, and hence requires only 
the number of clusters parameter k. Therefore we perform the 
sensitivity analysis of COMPOSE with respect to this parameter 
common to all three versions of COMPOSE. Table [VL] shows 
the results obtained by COMPOSE using cluster-and-label, 
where for each dataset, we provide the COMPOSE performance 
with the optimal k value, as well as k incorrectly chosen by 
just ”1.” This +1 represents the smallest possible change in k 
around its optimal value. For example, if the optimal value 
is k = 4, the three values of k used for comparison are 
k = 3, k = 4, and k = 5. When optimal k is two, the selection 
of k = 1 is, of course, meaningless, as k = 1 would result 
in all instances being classified into the same class. Hence, 
such cases are indicated as N/A in Table We observe 
that the cluster-and-label is able to identify the structure in 
the data from few labeled instances, and it does so reasonably 
well even when there is overlap among the clusters. However, 
this performance is subject to correct choice of the number of 
clusters k in the data, to which it tends to be rather sensitive, 
and in most datasets changing the value of k from the optimal 
value even just by 1, significantly and catastrophically reduces 
the average accuracy for that dataset. 

In summary, then, there is no statistically significance dif- 
ference among any of the algorithms with respect to classifica- 
tion accuracy (though FAST COMPOSE consistently perform 
better). FAST COMPOSE and COMPOSE with GMM are 
significantly better in terms of runtime, and LEVEL w appears 
to be more robust with respect to parameter variations among 
other algorithms. 


B. Analysis of SCARGC 


1) Accuracy Comparison: We included two versions of 
SCARGC, one using nearest neighborhood (INN) and the 
other using support vector machines (SVM), neither of which 
provided any significant difference over any of the other 
algorithms in terms of classification accuracy, as shown in 
table [HI 

2) Computational Complexity Comparison: With respect to 
the execution time, SCARGC (1-NN) does not show significant 
improvement over any algorithm, while SCARGC (SVM) 
shows significant improvement over COMPOSE (a-shape), 
and MClassification as shown in table FAST COMPOSE 
showed a significant improvement over SCARGC (1-NN) as 
previously discussed, and as can also be seen in table 
the computational performance of FAST COMPOSE is not 
significantly better than SCARGC (SVM), however, FAST 


COMPOSE does take less computation time on almost every 
dataset as compared to SCARGC (SVM). 

3) Parameter Sensitivity Comparison: SCARGC has three 
input parameters, initial labeled data, pool size and the number 
of clusters. The authors in the paper show that SCARGC is 
robust to the change in the values of the initial labeled data and 
the pool size (the number of instances in each batch evaluated 
by the algorithm at any given time). Therefore, we fixed and 
set the pool size equal to the batch size (drift interval shown in 
Table 1) used in all versions of COMPOSE and LEVELyy to 
ensure the fairness in comparison. As with all algorithms, we 
also assume that the entire initial batch of the data is labeled, 
followed by all unlabaled data. This allows all algorithm to 
see the exact same data in each batch. The third parameter, 
the number of clusters k, is the more useful one to test with 
respect to the parameter sensitivity. For the sensitivity analysis, 
we followed a similar procedure as we did for COMPOSE, 
and we evaluated SCARGC using the optimal k value, as well 
as k incorrectly chosen by just ”1”, as shown in Table [V| We 
observed that performance shown by SCARGC is also quite 
sensitive to correct choice of this parameter, as the performance 
drops dramatically and significantly for incorrect choices of k, 
particularly for the cases with class overlap. Overestimating the 
value of k from its optimal value does not hurt the performance 
much for those datasets that do not have class overlap, though 
- perhaps not surprisingly - underestimating this value does 
negatively impact the classification accuracy. 


C. Analysis of MClassification 


1) Accuracy Comparison: McClassification behaves simi- 
larly to other algorithms in terms of the classification accuracy 
when averaged across all datasets, and does not provide any 
significant difference as shown in Table m] However, it 
performs worse than any other algorithms on two specific 
datasets i.e. FG_2C_2D, and MG_2C_2D. By looking in 
more depth the evolution of the classes at different time-steps, 
we notice that for both of these datasets there is a sudden 
change in the modes or clusters representing two classes of 
the data at some time-step, co-occurring with the overlap of 
classes that causes the drop in the performance. The other 
algorithms do see a small drop in their performance when the 
overlap occurs (as expected), but they do not lose track of the 
clusters with the sudden change in the positions and parameters 
of the distributions representing classes. 

2) Computational Complexity Comparison: With respect 
to the execution time, this algorithm appears to be the worst 
algorithm (other than APT), providing the highest rank (highest 
being the worst and lowest being the best) as shown in Table 
As shown in Table [IV] MClassification takes significantly 
longer to run than all other algorithms, except COMPOSE with 
a-shape (and perhaps APT) with which the difference is not 
significant. 

3) Parameter Sensitivity Comparison: From the parameter 
sensitivity perspective, we first note that MClassification is 
introduced by the same authors of SCARGC as an alternative 
that is claimed to use a parameter that is less sensitive and 
requires no prior knowledge to tune. The only parameter this 


JOURNAL OF — , VOL. , NO. , MONTH 20- 9 
TABLE III: Statistical significance at a = 0.05 for classification accuracy 
COMPOSE(a-shape) | COMPOSE(GMM) | FAST COMPOSE | SCARGC(I-NN) | SCARGC(SVM) | MClassification | LEVELrw 
COMPOSE(a-shape) n/a 
COMPOSE(GMM) n/a 
FAST COMPOSE n/a 
SCARGC(I-NN) n/a 
SCARGC(SVM) n/a 
MClassification n/a 
IW n/a 
TABLE IV: Statistical significance at a = 0.05 for execution time 
COMPOSE(a-shape) | COMPOSE(GMM) | FAST COMPOSE | SCARGC(I-NN) | SCARGC(SVM) | MClassification | LEVEL pw 
COMPOSE(a-shape) n/a t ft T 
COMPOSE(GMM) — n/a — — 
FAST COMPOSE — n/a = — 
SCARGC(I-NN) T T n/a 
SCARGC(SVM) = n/a = 
MClassification T t T n/a T 
LEVELiw T = n/a 


algorithms uses is the maximum micro-cluster radius threshold 
r, a user-defined parameter that the authors claim is quite robust. 
The authors further argue that the value r = 0.1 works generally 
well in all cases. In order to test this claim, we evaluated this 
algorithm on 8 different values of the parameter r, i.e., 0.01, 
0.05, 0.1, 0.2, 0.5, 1, 1.5 and 2, whose results are given in 
the Table For each of the datasets, three different values 
of r were used, representing the smallest possible value of 
0.01 and largest value of r among all values on which the 
algorithm starts seeing a drop in its performance, and the 
claimed default value of r = 0.1. We observed that for all 
datasets except IMG_2C_2D, the lower values of 0.01 and 
0.05 do not make any difference to the performance from 
the optimal value. However, the performance does not remain 
consistent when the values greater than the optimal value are 
used: increasing the threshold value decreases the performance. 
The performance decreases more dramatically for the datasets 
that possess significant class overlap. 


IV. ANALYSIS OF LEVELw 


1) Accuracy Comparison: The average classification accu- 
racy shown by LEVELyw for all datasets was, as previously 
mentioned, not statistically significantly different from the 
remaining algorithms as shown in Table However, we 
observe that LEVELyw performs specifically rather poorly for 
datasets with significant between-class overlap, as can be seen 
from Table |I| The reason for this relatively poor performance 
can be traced to the assumptions made by domain adaptation 
algorithms: the significant between-class overlap coupled with 
a drifting environment ultimately leads to a significant change 
in the posterior probability distribution p(y|x) of classes, 
violating one of the covariate shift assumptions behind domain 
adaptation algorithms in general, and LEVELyw in particular. 
We note that the ability of other algorithms to perform well 
even under significant between-class overlap is in fact due to 
a crucial piece of information provided to them, through one 
of their free-parameters. 

2) Computational Complexity Comparison: With respect to 


the execution time, LEVELrw is significantly slower compared 
to FAST COMPOSE only, as shown in Table 


3) Parameter Sensitivity Comparison: Since LEVELjw is 
a wrapper around an algorithm TWLSPC, the free parameters 
for LEVELyw are the same as the free parameters of IWLSPC 
algorithm i.e. the regularization parameter A, and the kernel 
bandwidth parameter o. However, the more influential free 
parameter for LEVELyw is the value of the kernel width o as 
used in Gaussian kernel, therefore we performed the sensitivity 
analysis for this parameter. The kernel width does not provide 
any direct information on the number of clusters, but rather 
on the overall smoothness of the decision boundaries. Such 
information, while not terribly useful after a complete overlap, 
provides more protection and less sensitivity to minor or even 
moderate changes in its value. To see this effect, a parameter 
sweep range was chosen to cover a range commonly known to 
work well in other algorithms that use Gaussian kernels, and 
include the values of 0.01, 0.1, 0.2, 0.5, 1, 1.2, 1.5, 2, and 5. 
In Table [VIII] we show the performance of LEVELw for each 
of the datasets with three different values of o, representing 
the smallest and largest values of o on which the algorithm 
performs well, as well as an additional value in the middle 
of the two. We observe that LEVELjw is surprisingly robust 
to such wide fluctuations of ø values of typically five fold, 
and sometimes as wide as an order of magnitude difference. 
This outcome shows the consistent and stable performance 
of LEVELyw, its most prominent advantage over remaining 
algorithms. 


V. ANALYSIS ON TWO ADDITIONAL REAL WORLD 
DATASETS 


The Keystroke dataset that was included in all aforemen- 
tioned experiments is the only real world dataset in the original 
benchmark. That benchmark was used in part because it was 
used by other algorithms, allowing a fair comparison of our 
results to those reported in their respective publications [14], 
(13), (15), (16), [20]. We had access to two additional datasets 
which we used separately, on which we evaluated all four main 
groups of algorithms. In this section we discuss the behavior 
of these algorithms on these two additional real world datasets, 
namely Weather and Traffic datasets. For this analysis, among 
three versions of COMPOSE, we use COMPOSE.V3 (FAST 
COMPOSE), because of its fewer parameter requirements and 
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Fig. 1: Accuracy of algorithms on real world weather data 


reduced computational complexity, and in general we know 
that it works as well or better than the previous two versions. 

The Weather dataset is created by [13], and is based on 
the raw data obtained from the National Oceanic and Atmo- 
spheric Administration (NOAA) department. The raw data was 
collected over a 50-year span from Offutt Air Force Base in 
Bellevue, Nebraska. Eight features (temperature, dew point, sea- 
level pressure, visibility, average wind speed, max sustained 
wind speed, and minimum and maximum temperature) are 
used to determine whether each day experienced precipitation 
(rain) or not. The data set contains 18,159 daily readings of 
which 5,698 are rain and the remaining 12,461 are no rain. 
Hence this data has moderate class imbalance with 68.62% 
of the instances belonging to class 1 while 31.38% of the 
instances belonging to the other class. Data were grouped into 
49 batches of one-year intervals, each containing 365 instances 
(days); the remaining data were placed into the fiftieth batch 
as a partial year. The imbalance inherent in the overall data, 
combined with consistent significant class overlap caused all 
algorithms to classify all data to one class, giving (a false 
sense of) accuracy of 69% on this dataset as shown in Figure 
Therefore, the results on this dataset are inconclusive. 

The second real dataset we use in our analysis is the 
Traffic dataset, which was first introduced in [24]. This dataset 
consists of 5,412 instances, 512 real attributes and 2 classes, 
representing whether a traffic intersection is busy (has cars 
in the intersection) or empty. The images in this dataset are 
captured from a fixed traffic camera continuously observing 
an intersection over a two-week period. Some sample images 
of this dataset are shown in Figure [2] 

The concept drift in this dataset is due to the ambient changes 
in the scene that occur because of the variations in illumination, 
shadows, fog, snow, or even light saturation from oncoming 
cars, etc. We observe that this dataset also possesses imbalance 
but not as significant as the Weather dataset: out of 5,412 
instances, 3,168 instances (58.54%) belong to class 1, while 
2,244 instances (41.46%) belong to the other class. While the 
overall data does not have significant imbalance inherent in it, 
dividing the data into batches does add significant imbalance 
to certain batches of data. The imbalance becomes increasingly 


more significant with the number of batches. 

Figure [3] shows the performance of each algorithm on this 
dataset with different number of batches, where Figure fa) 
represents the classification accuracy of SCARGC for 5, 10, 
15, 18, and 20 batches. Figure Bro), Figure Bic), and Figure 
Bro) show the same information for MClassification, FAST 
COMPOSE and LEVELiw, respectively. We observe that all 
algorithms show around 76% classification accuracy, so long 
as the number of batches is less than or equal to 18. 
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Fig. 2: Sample images of traffic scenes streaming from a traffic 
camera 


The only minor exception is MClassification algorithm, 
which can perform equally well even if the data is divided 
into more than 18 batches (for instance 20 batches as seen in 
Figure [3{b)). We attribute this behavior to the online nature of 
this algorithm, as it can process data one example or instance 
at a time, and hence the algorithm is not bothered by the 
batch size. With all other algorithms, the problem with batch 
size can be linked to the class imbalance: If the data is split 
into 20 batches, ten batches contain on average 68% and 32% 
imbalance among classes, while the other ten batches contain 
imbalance on average equal to the imbalance of the overall data 
i.e. 58.54% and 41.46%. These results further confirm a mutual 
shortcoming of concept drift algorithms that are asked to work 
under extreme verification latency that they are sensitive to 
class imbalance. 


VI. 


This paper provides a comprehensive evaluation of existing 
approaches that learn from a nonstationary (drifting) environ- 
ment experiencing extreme verification latency, with respect to 
classification accuracy, computational complexity and parame- 
ter sensitivity. In a nonstationary streaming environment, the 
nonstationary data, drawn from a drifting distribution, arrive 
in a streaming manner. The extreme verification latency places 
an additional constraint that beyond an initial batch, the entire 
data stream is assumed unlabeled. 

The most important contribution of this work is the compre- 
hensive and comparative analysis of the available algorithms in 
the literature to handle extreme verification latency from three 
different perspectives: classification accuracy, computational 
complexity and parameter sensitivity. Our goal in this task 
has been to determine and describe the relative strengths 
and weaknesses of these algorithms, and point out different 


CONCLUSION 
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TABLE V: Accuracy with three different values of k (SCARGC) 


DATASETS Reduced k (Accuracy) | Optimal k (Accuracy) | Increased k (Accuracy) 
1CDT N/A k=2 (99.72) k=3 (99.72) 
1CHT N/A k=2 (99.27) k=3 (99.22) 
1CSurr k=4 (91.68) k=5 (94.99) k=6 (91.66) 
2CDT N/A k=2 (87.82) k=3 (51.99) 
2CHT N/A k=2 (83.39) k=3 (67.48) 

4CEICF k=4 (2.15) k=5 (92.79) k=6 (49.67) 
4CR k=3 (25.33) k=4 (98.94) k=5 (98.94) 
4CRE-V2 k=3 (24.82) k=4 (91.46) k=5 (39.72) 
FG_2C_2D k=3 (68.49) k=4 (95.60) k=5 (94.91) 
GEARS_2C_2D N/A k=2 (95.81) k=3 (88.06) 

MG _2C_2D k=3 (64.87) k=4 (92.94) k=5 (82.76) 

UG_2C_2D N/A k=2 (95.62) k=3 (57.19) 

UG_2C_3D N/A k=2 (94.91) k=3 (80.20) 

UG_2C_5D N/A k=2 (90.94) k=3 (75.08) 

keystroke k=9 (57.43) k=10 (88.07) k=11 (58.07) 


TABLE VI: Accuracy with three different values of r (MClassification) 


DATASETS lowest r (Accuracy) | Middle r (Accuracy) | Highest r (Accuracy) 
1CDT r=0.01(99.85) r=0.1(99.89) r=2(97.85) 
ICHT r=0.01(99.23) r=0.1(99.38) r=2(92.97) 
1CSurr r=0.01(84.80) r=0.1(85.15) r=0.5(48.67) 
2CDT r=0.01(94.76) r=0.1(95.23) r=0.5(55.84) 
2CHT r=0.01(86.50) r=0.1(87.93) r=0.5(56.37) 

4CEICF r=0.01(94.59) r=0.1(94.38) r=2(96.21) 
4CR r=0.01(99.98) r=0.1(99.98) r=1(23.02) 
4CRE-V2 r=0.01(91.21) r=0.1(91.59) r=0.5(27.80) 

FG_2C_2D r=0.01(59.20) r=0.1(62.48) r=0.2(55.84) 

GEARS_2C_2D r=0.01(95.23) r=0.1(94.73) r=0.3(93.90) 

MG _2C_2D r=0.01(51.10) r=0.1(80.58) r=0.2(74.41) 

UG_2C_2D r=0.01(95.12) r=0.1(95.28) r=0.5(51.87) 

UG_2C_3D r=0.01(94.57) r=0.1(94.72) r=0.5(52.44) 

UG_2C_5D r=0.01(91.31) r=0.1(91.25) r=1(68.17) 

keystroke r=0.01(90.62) r=0.1(76.90) r=0.2(73.86) 


cases and scenarios where one algorithm is better suited over 
the others. The original COMPOSE algorithm, COMPOSE 
with a-shape (COMPOSE.V1), was a significant contribution 
to the field when it was first proposed, as it was the only 
algorithm capable at the time to address the problem of learning 
in nonstationary environments in the presence of extreme 
verification latency with no restrictions on the nature of the data 
distribution. However, that capability came at a steep price: the 
algorithm is computationally very expensive (though still signif- 
icantly more efficient than the Arbitrary Population subTracker 
(APT) as well as the MClassification). The algorithm also 
provided to be quite sensitive to the choice of its primary free 
parameters. Despite these shortcomings, and despite several 
other competing algorithms developed since then, the original 
COMPOSE algorithm remains competitive with respect to 
classification accuracy. The second version of COMPOSE, 
COMPOSE with GMM (COMPOSE. V2), replaced the a-shape 
based approach for determining the core supports with a 
Gaussian mixture model based density estimation module 
that dramatically increased its computational efficiency while 
retaining the classification accuracy of COMPOSE.V1. The 
latest version of COMPOSE i.e. FAST COMPOSE, further im- 
proves the classification accuracy as well as the computational 
efficiency compared to all other algorithms. One remaining 
issue with FAST COMPOSE, however, is its sensitivity to the 


choice of its primary free parameter, the number of clusters 
in the cluster-and-label based SSL algorithm used in its core 
support computation. 

SCARGC was developed as a competing algorithm to the 
original COMPOSE with the primary advantage of better 
computational efficiency. SCARGC with nearest neighbor 
(INN) shows comparable accuracy to other algorithms and 
is less computationally expensive compared to COMPOSE (a- 
shape), and MClassification (but not against COMPOSE with 
GMM or FAST COMPOSE), while it too is also sensitive to 
the choice of its primary free parameter — number of clusters k 
in k-means clustering based subroutine it uses. SCARGC with 
SVM while perhaps reasonable with respect to computational 
burden, was found to be the worst (highest rank) in terms of 
classification accuracy. SCARGC with SVM retains the high 
parameter sensitivity as with SCARGC (INN). 

MClassification shows comparable accuracy performance 
to other algorithms but appears to be the worst algorithm in 
computational complexity (other than APT), requiring more 
runtime than even COMPOSE with a-shape on most datasets. 
This behavior is attributed to its online nature. In fact, MClassi- 
fication is the only algorithm that is capable of processing the 
data in an online manner, a distinct advantage in a streaming 
environment, but that advantage appears to be unrealized or 
wasted due to the heavy computational burden. This algorithm 
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TABLE VII: Accuracy with three different values of k (COMPOSE) 


DATASETS Reduced k (Accuracy) | Optimal k (Accuracy) | Increased k (Accuracy) 
1CDT N/A k=2 (99.85) k=3 (99.76) 
1CHT N/A k=2 (99.34) k=3 (98.72) 
1CSurr k=3 (85.58) k=4 (94.55) k=5 (91.52) 
2CDT N/A k=2 (95.91) k=3 (52.91) 
2CHT N/A k=2 (89.63) k=3 (77.33) 

4CEICF k=4 (78.96) k=5 (93.90) k=6 (94.66) 
4CR k=3 (74.88) k=4 (99.98) k=5 (99.98) 
4CRE-V2 k=3 (25.13) k=4 (92.30) k=5 (22.78) 
FG_2C_2D k=3 (68.91) k=4 (95.50) k=5 (95.44) 
GEARS_2C_2D N/A k=2 (95.82) k=3 (87.99) 

MG _2C_2D k=3 (65.32) k=4 (93.20) k=5 (92.07) 

UG_2C_2D N/A k=2 (95.71) k=3 (56.28) 

UG_2C_3D N/A k=2 (95.20) k=3 (91.46) 

UG_2C_5D N/A k=2 (92.12) k=3 (88.03) 

keystroke k=9 (68.62) k=10 (87.21) k=11 (81.56) 


TABLE VIII: Accuracy with three different values of sigma (LEVEL) 


DATASETS lowest sigma (Accuracy) | Middle sigma (Accuracy) | Highest sigma (Accuracy) 
1CDT 0.2 (99.91) 1 (99.91) 2 (99.92) 
ICHT 0.2 (99.40) 1 (99.42) 2 (99.51) 
1CSurr 1 (91.30) 1.5 (90.00) 2 (87.79) 
2CDT 0.2 (58.32) 0.5 (50.32) 1 (50.48) 
2CHT 0.2 (50.10) 0.5 (50.89) 1 (52.15) 

4CEICF 0.2 (97.74) 0.5 (97.12) 1.5 (92.40) 
4CR 0.2 (99.99) 1 (99.99) 2 (99.99) 
4CRE-V2 0.2 (20.96) 0.5 (20.84) 1 (24.10) 
FG_2C_2D 0.2 (95.71) 0.5 (86.41) 1 (94.28) 
GEARS_2C_2D 0.2 (97.73) 1 (95.28) 2 (95.36) 

MG_2C_2D 0.2 (78.03) 0.5 (78.21) 1.2 (85.44) 

UG_2C_2D 0.2 (70.61) 0.5 (71.81) 1 (74.33) 

UG_2C_3D 0.1 (61.21) 1 (64.30) 2 (64.68) 

UG_2C_5D 0.5 (77.67) 1 (80.07) 1.5 (80.17) 

keystroke 0.5 (88.12) 1 (90.56) 2 (89.43) 


is also quite sensitive to its primary free parameter. 
LEVEL jw, as with other algorithms, performed comparably 
similar with respect to classification accuracy, is less computa- 
tionally expensive than COMPOSE. V1, SCARGC (1-NN), and 
MClassification (but more expensive than FAST COMPOSE, 
SCARGC (SVM) and COMPOSE.V2). While not the best 
performing algorithm either in terms of classification accuracy 
or computational efficiency, LEVELrw has one advantage over 
other algorithms: greater robustness and stability compared to 
all of the remaining algorithms with respect to relatively wide 
fluctuations of the value of its primary free parameter. 


VII. 


Further work is needed to generate or acquire more chal- 
lenging datasets, as most algorithms perform similarly on the 
current synthetic benchmark datasets. Currently, there is a lack 
of datasets that contain abruptly changing distributions, datasets 
with recurring concepts or more severe class imbalances, 
datasets that have substantial feature or class noise, datasets 
with significant amount of outliers, datasets with very little or 
almost no shared support, and high dimensional datasets to 
name a few. 

We already know from the analyses shown in this work that 
the algorithms described here will not work in all of the above- 
mentioned scenarios, such as abruptly changing distributions 


SUMMARY OF FUTURE WORK 


or severe class imbalance. Often in science, however, it is a 
challenging dataset, or a collection of datasets that provide 
the motivation for the development of specialized algorithms 
within a specific disciple. Additionally, future work is needed 
to provide machine learning community with an algorithm 
that can perform well with respect to classification accuracy, 
computationally complexity and parameter sensitivity as well 
as able to handle challenging datasets mentioned above under 
initially labeled non-stationary environments. 
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